Ergodic multigram HMM integrating word segmentation and class tagging for Chinese language modeling
نویسندگان
چکیده
A novel Ergodic Multigram Hidden Markov Model (HMM) is introduced which models sentence production as a doubly stochastic process, in which word classes are first produced according to a first order Markov model, and then single or multi-character words are generated independently based on the word classes, without word boundary marked on the sentence. This model can be applied to languages without word boundary markers such as Chinese. With a lexicon containing syntactic classes for each word, its applications include language modeling for recognizers, and integrated word segmentation and class tagging. Pre-segmented and tagged corpus are not needed for training, and both segmentation and tagging are trained in one single model. In this paper, relevant algorithms for this model are presented, and experimental results on a Chinese news corpus are reported. Another approach is to keep all possible segmentations in a lattice form, score the lattice with a language model, and finally retrieve the best candidate by dynamic programming or some searching algorithms. N-gram models are usually used for scoring [6] [7], but their training requires the sentences of the corpus to be segmented (and tagged if class-based N-gram is used [7]). We introduce the Ergodic Multigram Hidden Markov Model which, when applied as a language model for these languages, integrates the segmentation and tagging processes into one model, and does not assume any prior segmentation or class tagging. Thus both training and scoring can be done using the model directly on a raw corpus. This model can be applied as a language model for an input sentence (or a character lattice), and the maximum likelihood segmentation and class-tagging of it can be obtained using the Viterbi or Stack Decoding Algorithm. 2. TERMINOLOGY
منابع مشابه
N-th Order Ergodic Multigram HMM for Modeling of Languages without Marked Word Boundaries
I,;rgodie IIMMs have been successfully used for modeling sentence production. llowever for some oriental languages such as Chinese, a word can consist of multiple characters without word boundary markers between adjacent words in a sentence. This makes wordsegmentation on the training and testing data necessary before ergodic ItMM can be applied as the langnage model. This paper introduces the ...
متن کاملA Chinese Efficient Analyser Integrating Word Segmentation, Part-Of-Speech Tagging, Partial Parsing and Full Parsing
This paper introduces an efficient analyser for the Chinese language, which efficiently and effectively integrates word segmentation, part-of-speech tagging, partial parsing and full parsing. The Chinese efficient analyser is based on a Hidden Markov Model (HMM) and an HMM-based tagger. That is, all the components are based on the same HMM-based tagging engine. One advantage of using the same s...
متن کاملHMM and CRF Based Hybrid Model for Chinese Lexical Analysis
This paper presents the Chinese lexical analysis systems developed by Natural Language Processing Laboratory at Dalian University of Technology, which were evaluated in the 4th International Chinese Language Processing Bakeoff. The HMM and CRF hybrid model, which combines character-based model with word-based model in a directed graph, is adopted in system developing. Both the closed and open t...
متن کاملInducing Word and Part-of-Speech with Pitman-Yor Hidden Semi-Markov Models
We propose a nonparametric Bayesian model for joint unsupervised word segmentation and part-of-speech tagging from raw strings. Extending a previous model for word segmentation, our model is called a Pitman-Yor Hidden SemiMarkov Model (PYHSMM) and considered as a method to build a class n-gram language model directly from strings, while integrating character and word level information. Experime...
متن کاملChinese Lexical Analysis Using Hierarchical Hidden Markov Model
This paper presents a unified approach for Chinese lexical analysis using hierarchical hidden Markov model (HHMM), which aims to incorporate Chinese word segmentation, Part-Of-Speech tagging, disambiguation and unknown words recognition into a whole theoretical frame. A class-based HMM is applied in word segmentation, and in this level unknown words are treated in the same way as common words l...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1996